Pitfalls in Benchmarking Data Stream Classification and How to Avoid Them
نویسندگان
چکیده
Data stream classification plays an important role in modern data analysis, where data arrives in a stream and needs to be mined in real time. In the data stream setting the underlying distribution from which this data comes may be changing and evolving, and so classifiers that can update themselves during operation are becoming the state-of-the-art. In this paper we show that data streams may have an important temporal component, which currently is not considered in the evaluation and benchmarking of data stream classifiers. We demonstrate how a naive classifier considering the temporal component only outperforms a lot of current state-of-the-art classifiers on real data streams that have temporal dependence, i.e. data is autocorrelated. We propose to evaluate data stream classifiers taking into account temporal dependence, and introduce a new evaluation measure, which provides a more accurate gauge of data stream classifier performance. In response to the temporal dependence issue we propose a generic wrapper for data stream classifiers, which incorporates the temporal component into the attribute space.
منابع مشابه
Detecting Concept Drift in Data Stream Using Semi-Supervised Classification
Data stream is a sequence of data generated from various information sources at a high speed and high volume. Classifying data streams faces the three challenges of unlimited length, online processing, and concept drift. In related research, to meet the challenge of unlimited stream length, commonly the stream is divided into fixed size windows or gradual forgetting is used. Concept drift refer...
متن کاملSolving Problems with CP: Four Common Pitfalls to Avoid
Constraint Programming (CP) is a general technique for solving combinatorial optimization problems. Real world problems are quite complex and solving them requires to divide work into different parts. Mainly, there are: the abstraction of interesting and relevant subparts, the definition of benchmarks and design of a global model and the application of a particular search strategy. We propose t...
متن کاملDiagnostic pitfalls of pilomatricoma on fine needle aspiration cytology
Pilomatricoma is a benign skin adnexal tumour usually seen in the head and neck region of children and young adults. It is underrecognized on cytology, resulting in the overdiagnosis of malignancy. We bring forth a case report of a slow growing nodular swelling in a 10-year-old female child, which was misdiagnosed on Fine Needle Aspiration Cytology (FNAC) as a malignant neoplasm and found to be...
متن کاملTARGETING CUSTOMERS: A FUZZY CLASSIFICATION APPROACH
Nowadays, marketing serves the purpose of maximizing customer lifetime value (CLV) and customer equity, which is the sum of the lifetime values of the company’s customers. But, CLV calculation encounters some difficulties which limit the usage of this technique. Nonetheless, companies looking for methods to know how to calculate their customers’ CLV. In this paper, fuzzy classification rules we...
متن کاملNursing information system in hospitals of Zahedan city, Iran
Purpose: Nurses can directly influence health services quality and outcomes. They need correct information in due time to provide an effective service for the patient and other healthcare personnel. Nursing information system is a subsystem of hospital information system that can help nurses to have better performances. The goal of this study was to evaluate nursing information system of hospit...
متن کامل